“A random variable, usually written X, is a variable whose possible values are numerical outcomes of a random phenomenon.” http://www.stat.yale.edu/Courses/1997-98/101/ranvar.html
Two types:
\[ X = \begin{cases} 1 & \text{heads} \\ 2 & \text{ tails} \end{cases} \]
Y = Time it takes to run a mile
“A probability distribution is… an equation that links each outcome of a statistical experiment with its probability of occurrence.” http://stattrek.com/probability-distributions/probability-distribution.aspx
Suppose you flip a coin 2 times.
Possible outcomes: HH, HT, TH and TT
E.G. Probability distribution of the number of heads?
http://stattrek.com/probability-distributions/probability-distribution.aspx
“A cumulative probability refers to the probability that the value of a random variable falls within a specified range.” https://stattrek.com/statistics/dictionary.aspx?definition=cumulative_probability
Probability that 2 coin flips includes 1 or fewer heads?
\[ P(X ≤ 1) = P(X = 0) + P(X = 1) = 0.25 + 0.50 = 0.75 \]
Step interval size = range / number of intervals/bins
\[ k = \left\lceil \frac{\max x - \min x}{h} \right\rceil . \]
This formula is used to calculate the number of bins or intervals needed to divide a range of data, often used in creating histograms.
In simple terms:
The result, k, tells you how many bins of width h you would need to cover the entire range of your data.
For example, if your data ranges from 10 to 45, and you want each bin to be 5 units wide, you would calculate: k = ceiling((45 - 10)/5) = ceiling(7) = 7
So you would need 7 bins to cover the entire range of data with bins of width 5.
Histograms can change depending on the number of bins
http://www.shodor.org/interactivate/activities/Histogram/
Square-root choice
\[ k = \sqrt n , \]
Sturges’ formula
\[ k = \lceil log_2~n \rceil + 1, \]
Freedman-Diaconis
Rice
Scott
…
Distribution of times it took you to drive to work (average=50, SD=20)
What is the probability that it will take you 50 minutes?
P(Y = 50)
Distribution of times it took you to drive to work (average=50, SD=20)
P(49 < Y < 51)
Distribution of times it took you to drive to work (average=50, SD=20)
P(49 < Y < 51)
## Warning: package 'tigerstats' was built under R version 4.4.3
## Loading required package: abd
## Warning: package 'abd' was built under R version 4.4.3
## Loading required package: nlme
## Loading required package: lattice
## Loading required package: grid
## Loading required package: mosaic
## Warning: package 'mosaic' was built under R version 4.4.3
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
## Attaching package: 'mosaic'
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
## The following object is masked from 'package:Matrix':
##
## mean
## The following object is masked from 'package:ggplot2':
##
## stat
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
## quantile, sd, t.test, var
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
## Welcome to tigerstats!
## To learn more about this package, consult its website:
## http://homerhanumat.github.io/tigerstats
## [1] 0.03987761
Exact distribution vs. using an approximate distribution
“If the sample is large (n>=30) then statistical theory says that the sample mean is normally distributed… This is a result of a famous statistical theorem, the Central limit theorem.” Source: https://en.wikibooks.org/wiki/Statistics/Testing_Data/t-tests
\[ AreaUnderNormalCurve = \int_{-\infty}^{\infty} f(x)dx = 1 \]
https://commons.wikimedia.org/wiki/File:Empirical_Rule.PNG
mean=50, SD=25
## [1] 0.2118554
16z + 50
14z + 50
10 Z + 50
http://stattrek.com/statistics/charts/data-patterns.aspx?Tutorial=AP
Example – what about salary distribution
E.g. mean is a bad measure of the distribution if the variable is skewed
Measure of the tailedness and peakness
Tutorial on other distributions
\[ \text{Entropy}(D_1) = -\sum_{i=1}^{m} p_i \log_2\,p_i \]
M = number of classes
P = probability, e.g. number of people aged <25 earning <=50k (class)
| Salary <60k | Salary >60k | |
|---|---|---|
| Aged <30 | 30 | 2 |
| Aged >30 | 8 | 22 |
# Entropy(age<30) = (30/32 * log2(30/32)) + (2/32 * log2(2/32)) = 0.3372901
# Entropy(age>30) = (8/30 * log2(8/30)) + (22/30 * log2(22/30)) = 0.8366407
# Net entropy = mean(c(0.3372901, 0.8366407)) = 0.5869654Open rundata.csv on BBL
Open in Excel
# if issues run the following:
# getwd()
# setwd("./week2")
rundata <- read.csv("../rundata.csv")
rundata$FirstRun <- rnorm(99, 30, 10)
rundata$SecondRun <- rundata$FirstRun + rnorm(99, 5, 2)
rundata$FinalRun <- rundata$SecondRun + rnorm(99, 7, 3)Let’s assume each row is a runner and each number is how far they ran at each consecutive run.
## [1] 40.966650 45.223559 32.468533 29.365796 28.858644 39.200772 27.734068
## [8] 35.955872 9.902974 31.708646 17.905562 33.180694 31.617578 14.039046
## [15] 32.446460 17.564528 10.667120 28.700671 25.798031 35.870006 32.021256
## [22] 21.081807 42.462683 51.463381 23.016385 41.131854 50.179608 26.932851
## [29] 19.067766 40.593998 18.540404 26.631558 35.487406 37.473441 31.870145
## [36] 39.004494 32.993627 32.989987 31.386159 33.674469 17.098803 25.911280
## [43] 32.187033 30.953739 24.043190 41.283364 23.332843 23.645131 29.721161
## [50] 39.316774 21.838659 18.922503 41.129746 33.724212 13.533743 50.418537
## [57] 29.754818 44.520191 36.116746 32.424681 29.944536 45.811679 29.995593
## [64] 23.090961 18.170484 34.692438 43.174329 43.076396 29.606398 36.342192
## [71] 30.503618 28.101076 36.236813 45.893840 31.282998 20.571962 21.514469
## [78] 27.919763 50.366440 45.969079 19.621275 22.966757 33.405512 16.280908
## [85] 25.926971 13.372470 42.006253 26.277342 37.438542 15.214066 29.914108
## [92] 17.337447 21.746915 38.108213 31.774232 43.165166 33.526764 27.811286
## [99] 26.837567
## [1] 30.7278
## [1] 9.613917
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.903 23.489 31.283 30.728 36.890 51.463
## 0.1281879 with absolute error < 3.1e-06
Shapiro-Wilk test of normality.
##
## Shapiro-Wilk normality test
##
## data: rundata$FirstRun
## W = 0.98797, p-value = 0.5136
Hypothesis test
\[ H_0 = Null~ hypothesis:~ Data~ is~ normally~ distributed \]
\[ H_1 = Alternative~ hypothesis:~ Data~ is~ not~ normally~ distributed \]
plot(density(rundata$FirstRun), col="red")
lines(density(rundata$SecondRun), col="green")
legend("bottom", legend = c("First run","Second run"), fill=c("red","green"))time <- c(19.09, 19.55, 17.89, 17.73, 25.15, 27.27, 25.24, 21.05, 21.65, 20.92, 22.61, 15.71, 22.04, 22.60, 24.25)## [1] -0.01565162
## [1] 2.301051
https://www.r-bloggers.com/measures-of-skewness-and-kurtosis/
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## [1] -0.01693669
## [1] 3.062924
#Histogram
library(ggplot2)
datasim <- data.frame(sample)
ggplot(datasim, aes(x = sample), binwidth = 2) +
geom_histogram(aes(y = ..density..), fill = 'red', alpha = 0.5) +
geom_density(colour = 'blue') + xlab(expression(bold('Simulated Samples'))) +
ylab(expression(bold('Density')))## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.70 18.65 19.88 20.00 21.16 24.91
Other exercises
https://www.r-bloggers.com/normal-distribution-functions/
https://www.r-bloggers.com/measures-of-skewness-and-kurtosis/